Cleaning the Data

Here is how we loaded and cleaned the data, and the problems that arose in the process.

R and DuckDB Setup
library(readr)
library(tidyverse)
library(janitor)
library(gt)
library(reticulate)

library(DBI)
library(duckdb)
options(duckdb.enable_rstudio_connection_pane=TRUE)

drv <- duckdb()
con <- dbConnect(drv)
Python Setup
# py_install("pandas")
import pandas as pd
import numpy as np
import great_tables as gt #this makes the python tables look nice
import warnings #this is to silence a warning about using .replace later in the code
from great_tables import GT
warnings.simplefilter(action='ignore', category=FutureWarning)

Read the Tables from csv Files

These are the tables before cleaning and combining. Lots of unnecessary columns and inconsistent headings between the tables. Another issue to address is the quotations around the song titles on the Billboard chart. The song titles and artist names are also not consistent across tables.

Reading the csv files is pretty simple in R. read_csv reads the csv and the output can be saved. show_col_types = FALSE hides the console output that shows a count of the rows and columns and if the column is a chr, dbl, etc.

Spotify
spotify_charts <- read_csv("initial_tables/spotify_top_charts_22.csv", show_col_types = FALSE)
spotify_charts
TikTok
tiktok_charts <- read_csv("initial_tables/TikTok_songs_2022.csv", show_col_types = FALSE)
tiktok_charts
Billboard
billboard_charts <- read_csv("initial_tables/Billboard_hot_100_year_end_2022.csv", show_col_types = FALSE)
billboard_charts

This code makes (or replaces) a table from the csv file. SELECT * means that all columns in the csv will be brought over into the table. The tables only show five rows so that they do not take up too much of the page.

CREATE OR REPLACE TABLE spotify_charts AS
  SELECT * FROM 'initial_tables/spotify_top_charts_22.csv';
SELECT * FROM spotify_charts
LIMIT 5; --only show five rows
5 records
uri artist_names track_name peak_rank weeks_on_chart danceability energy key loudness mode speechiness acousticness instrumentalness liveness tempo time_signature duration_ms
spotify:track:02MWAaffLxlfxAUY7c5dvx Glass Animals Heat Waves 1 65 0.761 0.525 11 -6.900 1 0.0944 0.4400 6.70e-06 0.0921 80.870 4 238805
spotify:track:5PjdY0CKGZdEuoNab3yDmX The Kid LAROI, Justin Bieber STAY (with Justin Bieber) 1 37 0.591 0.764 1 -5.484 1 0.0483 0.0383 0.00e+00 0.1030 169.928 4 141806
spotify:track:3IAfUEeaXRX9s9UdKOJrFI Anitta Envolver 3 3 0.812 0.736 4 -5.421 0 0.0833 0.1520 2.54e-03 0.0914 91.993 4 193806
spotify:track:1HhNoOuqm1a5MXYEgAFl8o Imagine Dragons, JID, Arcane, League of Legends Enemy (with JID) - from the series Arcane League of Legends 3 21 0.728 0.783 11 -4.424 0 0.2660 0.2370 0.00e+00 0.4340 77.011 4 173381
spotify:track:4fouWK6XVHhzl78KzQ1UjL GAYLE abcdefu 1 19 0.695 0.540 4 -5.692 1 0.0493 0.2990 0.00e+00 0.3670 121.932 4 168602
CREATE OR REPLACE TABLE tiktok_charts AS
  SELECT * FROM 'initial_tables/TikTok_songs_2022.csv';
SELECT * FROM tiktok_charts
LIMIT 5; --only show five rows
5 records
track_name artist_name artist_pop album track_pop danceability energy loudness mode key speechiness acousticness instrumentalness liveness valence tempo time_signature duration_ms
Running Up That Hill (A Deal With God) Kate Bush 81 Hounds Of Love 95 0.629 0.547 -13.123 0 10 0.0550 0.7200 3.14e-03 0.0604 0.197 108.375 4 298933
As It Was Harry Styles 91 As It Was 96 0.520 0.731 -5.338 0 6 0.0557 0.3420 1.01e-03 0.3110 0.662 173.930 4 167303
Sunroof Nicky Youre 73 Sunroof 44 0.768 0.716 -5.110 1 10 0.0404 0.3500 0.00e+00 0.1500 0.841 131.430 4 163026
Heat Waves Glass Animals 80 Dreamland (+ Bonus Levels) 89 0.761 0.525 -6.900 1 11 0.0944 0.4400 6.70e-06 0.0921 0.531 80.870 4 238805
About Damn Time Lizzo 81 About Damn Time 92 0.836 0.743 -6.305 0 10 0.0656 0.0995 0.00e+00 0.3350 0.722 108.966 4 191822
CREATE OR REPLACE TABLE billboard_charts AS
  SELECT * FROM 'initial_tables/Billboard_hot_100_year_end_2022.csv';
SELECT * FROM billboard_charts
LIMIT 5; --only show five rows
5 records
No. Title Artist(s)
1 “Heat Waves” Glass Animals
2 “As It Was” Harry Styles
3 “Stay” The Kid Laroi and Justin Bieber
4 “Easy on Me” Adele
5 “Shivers” Ed Sheeran

Making the tables in Python is very similar to R. pd.read_csv reads the csv file and returns a DataFrame.

Billboard
spotify_charts = pd.read_csv("initial_tables/spotify_top_charts_22.csv")
spotify_charts
                                      uri  ... duration_ms
0    spotify:track:02MWAaffLxlfxAUY7c5dvx  ...      238805
1    spotify:track:5PjdY0CKGZdEuoNab3yDmX  ...      141806
2    spotify:track:3IAfUEeaXRX9s9UdKOJrFI  ...      193806
3    spotify:track:1HhNoOuqm1a5MXYEgAFl8o  ...      173381
4    spotify:track:4fouWK6XVHhzl78KzQ1UjL  ...      168602
..                                    ...  ...         ...
641  spotify:track:0So2sgVa8aJiARPl2P29u2  ...      187333
642  spotify:track:1TUuhV75FeOF6UObRsABo2  ...      226975
643  spotify:track:3GbQcjKWT7eVpUfzqwUjxv  ...      180666
644  spotify:track:7My9ca9QEIR8MvIOdIrTWA  ...      215120
645  spotify:track:4axSuOg3BqsowKjRpj59RU  ...      264439

[646 rows x 17 columns]
Billboard
tiktok_charts = pd.read_csv("initial_tables/TikTok_songs_2022.csv")
tiktok_charts
                                 track_name  ... duration_ms
0    Running Up That Hill (A Deal With God)  ...      298933
1                                 As It Was  ...      167303
2                                   Sunroof  ...      163026
3                                Heat Waves  ...      238805
4                           About Damn Time  ...      191822
..                                      ...  ...         ...
258              The Less I Know The Better  ...      216320
259                              Dandelions  ...      233720
260           Jimmy Cooks (feat. 21 Savage)  ...      218365
261                            Good Looking  ...      214800
262                                 INFERNO  ...      133134

[263 rows x 18 columns]
Billboard
billboard_charts = pd.read_csv("initial_tables/Billboard_hot_100_year_end_2022.csv")
billboard_charts
    No.                       Title                        Artist(s)
0     1                "Heat Waves"                    Glass Animals
1     2                 "As It Was"                     Harry Styles
2     3                      "Stay"  The Kid Laroi and Justin Bieber
3     4                "Easy on Me"                            Adele
4     5                   "Shivers"                       Ed Sheeran
..  ...                         ...                              ...
95   96              "Flower Shops"   Ernest featuring Morgan Wallen
96   97               "To the Moon"        Jnr Choi and Sam Tompkins
97   98                    "Unholy"         Sam Smith and Kim Petras
98   99           "One Mississippi"                       Kane Brown
99  100  "Circles Around This Town"                     Maren Morris

[100 rows x 3 columns]

Cleaning

All the tables are messy. Both the Spotify and TikTok tables have columns that are not necessary for this project. To keep the tables simple, we cleaned the data to make all three tables as similar as possible. We renamed all the columns holding the song title to song_title, the artist(s) artist, and the popularity/rank to rank_s/rank_t/rank_b for Spotify, TikTok, and Billboard, respectively. We also included the tempo column on the Spotify and TikTok tables. The quotations are also removed from the song titles in the Billboard chart as well.

Spotify

For the spotify_charts, we filterd to keep on the the columns containing the song title, artist anme, peak rank, and tempo. We reordered and renamed the columns as follows: song_title, artist, rank_s, and tempo. The rank_s column indicated the highest rank the song appeared on the chart, so a lower number indicated that the song was more popular.

For the Spotify data, we used select() to only keep the relevant columns.

spotify_cleaned <- spotify_charts |> 
  clean_names() |> #clean the column headers
  select(track_name, artist_names, peak_rank, tempo) |> #select only certain columns
  rename(song_title = track_name, artist = artist_names, rank_s = peak_rank) |>
  arrange(rank_s) #order from most to least popular (lower number = more popular)

spotify_cleaned

Similar to in R, we used SELECT to only keep the relevant columns and used ORDER BY to arrange the rows from most to least popular.

--select only relevant columns and rename them appropriately
CREATE OR REPLACE TABLE spotify_cleaned AS
  SELECT track_name AS song_title, artist_names AS artist, peak_rank AS rank_s, tempo
  FROM spotify_charts
;
--order from most to least popular using the rank_s column (lower number = more popular)
FROM spotify_cleaned
ORDER BY rank_s ASC
Displaying records 1 - 10
song_title artist rank_s tempo
Heat Waves Glass Animals 1 80.870
STAY (with Justin Bieber) The Kid LAROI, Justin Bieber 1 169.928
abcdefu GAYLE 1 121.932
Easy On Me Adele 1 141.981
good 4 u Olivia Rodrigo 1 166.928
Blinding Lights The Weeknd 1 171.005
MONTERO (Call Me By Your Name) Lil Nas X 1 178.781
drivers license Olivia Rodrigo 1 143.875
DÁKITI Bad Bunny, Jhay Cortez 1 109.928
Beggin’ Måneskin 1 134.002

For the Spotify data, we used .filter to only keep the relevant columns and .sort_values to sort the rows.

spotify_cleaned = (spotify_charts
                   .filter(items = ['track_name', 'artist_names', 'peak_rank', 'tempo'])
                   .rename(columns = {'track_name': 'song_title',
                                      'artist_names': 'artist',
                                      'peak_rank': 'rank_s'})
                   .sort_values('rank_s') # lower number = higher rank
                  )
spotify_cleaned
                                        song_title  ...    tempo
0                                       Heat Waves  ...   80.870
1                        STAY (with Justin Bieber)  ...  169.928
4                                          abcdefu  ...  121.932
87          Peaches (feat. Daniel Caesar & Giveon)  ...   90.030
93   Sunflower - Spider-Man: Into the Spider-Verse  ...   89.960
..                                             ...  ...      ...
500            Hold That Heat (feat. Travis Scott)  ...  130.045
561                         Si Estuviésemos Juntos  ...  171.854
310                                      Heartless  ...   87.999
583                                 Paris to Tokyo  ...  146.733
609                                      Sehnsucht  ...  142.090

[646 rows x 4 columns]

TikTok

Similar to the Spotify dataset, we kept only the relevant columns and renamed them to be the same as the Spotify dataset. The rank_t column indicates the popularity of a song, with a higher value indicating that the song achieved greater popularity.

tiktok_cleaned <- tiktok_charts |>
  clean_names() |> #clean the column headers
  select(track_name, artist_name, track_pop, tempo) |> #filter for certain columns
  rename(song_title = track_name, artist = artist_name, rank_t = track_pop) |>
  arrange(desc(rank_t)) #arrange by popularity (higher number = more popular)

tiktok_cleaned
--create a cleaned version of the tiktok table and select only relevant columns
CREATE OR REPLACE TABLE tiktok_cleaned AS
  SELECT track_name AS song_title, artist_name AS artist, track_pop AS rank_t, tempo
  FROM tiktok_charts
;
--order from most to least popular using the rank_t column
FROM tiktok_cleaned
ORDER BY rank_t DESC
Displaying records 1 - 10
song_title artist rank_t tempo
Glimpse of Us Joji 97 169.914
As It Was Harry Styles 96 173.930
Running Up That Hill (A Deal With God) Kate Bush 95 108.375
Late Night Talking Harry Styles 93 114.996
About Damn Time Lizzo 92 108.966
Jimmy Cooks (feat. 21 Savage) Drake 92 165.921
MIDDLE OF THE NIGHT Elley Duhé 91 185.727
Until I Found You Stephen Sanchez 90 101.358
Blinding Lights The Weeknd 90 171.005
Sweater Weather The Neighbourhood 90 124.053
tiktok_cleaned = (tiktok_charts
                   .filter(items = ['track_name', 'artist_name', 'track_pop', 'tempo'])
                   .rename(columns = {'track_name': 'song_title',
                                      'artist_name': 'artist',
                                      'track_pop': 'rank_t'})
                   .sort_values('rank_t', ascending=False) # higher numbers = more popular
                  )
                  
tiktok_cleaned
                                 song_title             artist  rank_t    tempo
5                             Glimpse of Us               Joji      97  169.914
1                                 As It Was       Harry Styles      96  173.930
0    Running Up That Hill (A Deal With God)          Kate Bush      95  108.375
52                       Late Night Talking       Harry Styles      93  114.996
260           Jimmy Cooks (feat. 21 Savage)              Drake      92  165.921
..                                      ...                ...     ...      ...
131                                 What If              HOURS       0  126.001
184                            Backyard Boy  Claire Rosinkranz       0  138.026
130                               Slow Down             Coopex       0  128.048
42                              Without You       John De Sohn       0  124.968
200                      Right Here Waiting          Boostereo       0  115.052

[263 rows x 4 columns]

Billboard

The billboard_chart was almost the same as the previous two tables. The columsn kept are the columns containing the song title, artist, and rank. The columns were named song_title, artist, and rank_b. The rank_b column indicated the rank of the song, with a lower number indicating a song that achieved a higher rank.

billboard_cleaned <- billboard_charts |>
  clean_names() |> #clean the column headers
  select(title, artist_s, no) |> #reorder the columns
  rename(song_title = title, rank_b = no, artist = artist_s) |> 
  mutate(song_title = str_replace_all(song_title, '"', '')) #take quotations off song titles

billboard_cleaned
--create a cleaned version of the billboard table and select only renamed relevant columns
CREATE OR REPLACE TABLE billboard_cleaned AS
  SELECT replace(title, '"', '') AS song_title, "artist(s)" AS artist, "no." AS rank_b
  FROM billboard_charts
;
--order from most to least popular using the rank_b column
FROM billboard_cleaned
ORDER BY rank_b
Displaying records 1 - 10
song_title artist rank_b
Heat Waves Glass Animals 1
As It Was Harry Styles 2
Stay The Kid Laroi and Justin Bieber 3
Easy on Me Adele 4
Shivers Ed Sheeran 5
First Class Jack Harlow 6
Big Energy Latto 7
Ghost Justin Bieber 8
Super Gremlin Kodak Black 9
Cold Heart (Pnau Remix) Elton John and Dua Lipa 10
billboard_cleaned = (billboard_charts
                   .rename(columns = {'Title': 'song_title',
                                      'Artist(s)': 'artist',
                                      'No.': 'rank_b'})
                   .sort_values('rank_b') # highest to lowest rank
                  )
billboard_cleaned['song_title'] = billboard_cleaned['song_title'].str.replace('"', '')

billboard_cleaned
    rank_b                song_title                           artist
0        1                Heat Waves                    Glass Animals
1        2                 As It Was                     Harry Styles
2        3                      Stay  The Kid Laroi and Justin Bieber
3        4                Easy on Me                            Adele
4        5                   Shivers                       Ed Sheeran
..     ...                       ...                              ...
95      96              Flower Shops   Ernest featuring Morgan Wallen
96      97               To the Moon        Jnr Choi and Sam Tompkins
97      98                    Unholy         Sam Smith and Kim Petras
98      99           One Mississippi                       Kane Brown
99     100  Circles Around This Town                     Maren Morris

[100 rows x 3 columns]

Challenges

Two challenges stood out to us, both involving the variations in the way that the song metadata is recorded on the charts. The same song can appear on different albums with slgiht variations of the title, or can have a featured artist listed in the title or as an additional artist, or could be a cover of another song, etc. As an example, the song “Enemy” by Imagine Dragons appears twice on the Spotify chart, once on the TikTok chart, and once on the Billboard chart:

Show the code
spotify_cleaned |>
  filter(str_detect(song_title, 'Enemy')) #filter for song titles containing "Enemy"
Show the code
tiktok_cleaned |>
  filter(str_detect(song_title, 'Enemy'))
Show the code
billboard_cleaned |>
  filter(str_detect(song_title, 'Enemy'))
Spotify
SELECT *
FROM spotify_cleaned
WHERE song_title LIKE '%Enemy%'
2 records
song_title artist rank_s tempo
Enemy (with JID) - from the series Arcane League of Legends Imagine Dragons, JID, Arcane, League of Legends 3 77.011
Enemy - From the series Arcane League of Legends Imagine Dragons, Arcane, League of Legends 171 77.029
TikTok
SELECT *
FROM tiktok_cleaned
WHERE song_title LIKE '%Enemy%'
1 records
song_title artist rank_t tempo
Enemy (with JID) - from the series Arcane League of Legends Imagine Dragons 87 77.011
Billboard
SELECT *
FROM billboard_cleaned
WHERE song_title LIKE '%Enemy%'
1 records
song_title artist rank_b
Enemy Imagine Dragons and JID 15
Show the code
GT(spotify_cleaned.query('song_title.str.contains("Enemy")', engine='python'))
song_title artist rank_s tempo
Enemy (with JID) - from the series Arcane League of Legends Imagine Dragons, JID, Arcane, League of Legends 3 77.011
Enemy - From the series Arcane League of Legends Imagine Dragons, Arcane, League of Legends 171 77.029
Show the code
GT(tiktok_cleaned.query('song_title.str.contains("Enemy")', engine='python'))
song_title artist rank_t tempo
Enemy (with JID) - from the series Arcane League of Legends Imagine Dragons 87 77.011
Show the code
GT(billboard_cleaned.query('song_title.str.contains("Enemy")', engine='python'))
rank_b song_title artist
15 Enemy Imagine Dragons and JID

All three charts record the song title differently. The Spotify chart has two different versions of “Enemy” on it: one with JID and one without, so leaving those titles as is is fine. The TikTok and Billboard charts both have the version with JID, but the name of the song and artists listed are different. To handle these synonyms, we created a lookup table and applied it to each dataset. We chose to apply the dictionary to each individual table rather than a combined table so that when we combine the tables, songs with the same title and artists would be combined. The tables for the song titles and artist names were created in Google Sheets and uploaded to RStudio as csv files.

Change Song Titles

The process in the three languages was essentially the same. We read the csv file and stored the resulting table. Then we merged/joined the lookup table with each chart. The lookup table has two columns: one containing the canonical names and another containing the alternative names.

Read the lookup table.

Show the code
song_title_lookup <- read_csv("initial_tables/dict_song_titles.csv", show_col_types = FALSE)
song_title_lookup

We used a left_join and joined on the song_title column of the song chart on the alt_name column of the lookup table. This join created a new column called canonical_name on the chart. mutate is used to replace the value in the song_title columns with the name in the canonical_name, if there is one. coalesce takes the first non-null value, so if the canonical_name column is empty, then the song title will not be replaced.

spotify_cleaned <- spotify_cleaned |>
  left_join(song_title_lookup, by = join_by(song_title == alt_name)) |>
  mutate(song_title = coalesce(canonical_name, song_title)) |> #coalesce takes the first non-null value
  select(song_title, artist, rank_s, tempo) #keep only relevant columns
tiktok_cleaned <- tiktok_cleaned |>
  left_join(song_title_lookup, by = join_by(song_title == alt_name)) |>
  mutate(song_title = coalesce(canonical_name, song_title)) |>
  select(song_title, artist, rank_t, tempo)
billboard_cleaned <- billboard_cleaned |>
  left_join(song_title_lookup, by = join_by(song_title == alt_name)) |> 
  mutate(song_title = coalesce(canonical_name, song_title)) |>
  select(song_title, artist, rank_b)

Read the look table csv and create a table.

CREATE OR REPLACE TABLE song_title_lookup AS
  SELECT * 
  FROM 'initial_tables/dict_song_titles.csv';

We used UPDATE to avoid having to coalesce columns. This code checks if a song listed in the song title column in the chart is the same as a song listed in the alt_name column of the lookup table. If there is a match, then the song title is replaced with the song from the canonical_name column of the lookup table.

UPDATE spotify_cleaned
  SET song_title = song_title_lookup.canonical_name
  FROM song_title_lookup
  WHERE spotify_cleaned.song_title = song_title_lookup.alt_name
UPDATE tiktok_cleaned
  SET song_title = song_title_lookup.canonical_name
  FROM song_title_lookup
  WHERE tiktok_cleaned.song_title = song_title_lookup.alt_name
UPDATE billboard_cleaned
  SET song_title = song_title_lookup.canonical_name
  FROM song_title_lookup
  WHERE billboard_cleaned.song_title = song_title_lookup.alt_name

We have two different methods in Python here. This method uses .merge and is similar to the R version. The lookup table csv file is read. Then the chart table and the lookup table are left-merged on the song_title and alt_name. This creates two new columns in the chart called canonical_name and alt_name. If there is a value in the alt_name columns, the the value from the canonical_name column replaces the song_title.

song_syn = pd.read_csv('initial_tables/dict_song_titles.csv')

# Spotify
spotify_cleaned = spotify_cleaned.merge(song_syn, left_on = 'song_title', right_on = 'alt_name', how = 'left')

spotify_cleaned['song_title'] = np.where( #if the alt_name column has a value then, replace song_title with canonical_name
   spotify_cleaned['alt_name'].notna(), 
   spotify_cleaned['canonical_name'], 
   spotify_cleaned['song_title'])
   
spotify_cleaned = spotify_cleaned.drop(columns = ['canonical_name', 'alt_name']) # remove unnecessary columns


# TikTok
tiktok_cleaned = tiktok_cleaned.merge(song_syn, left_on = 'song_title', right_on = 'alt_name', how = 'left')

tiktok_cleaned['song_title'] = np.where(
   tiktok_cleaned['alt_name'].notna(), 
   tiktok_cleaned['canonical_name'], 
   tiktok_cleaned['song_title'])
   
tiktok_cleaned = tiktok_cleaned.drop(columns = ['canonical_name', 'alt_name'])


# Billboard
billboard_cleaned = billboard_cleaned.merge(song_syn, left_on = 'song_title', right_on = 'alt_name', how = 'left')

billboard_cleaned['song_title'] = np.where(
   billboard_cleaned['alt_name'].notna(), 
   billboard_cleaned['canonical_name'], 
   billboard_cleaned['song_title'])
   
billboard_cleaned = billboard_cleaned.drop(columns = ['canonical_name', 'alt_name'])

The second Python method uses .replace. This method was my initial inclination regarding how to recode the charts, but is not the best appraoch to this problem. This method works be converting the lookup table csv file into a dictionary of keys and values. The alt_name column is set as the index and is mapped to the canonical_name column. to_dict converts the result to a dictionary.

.replace looks for the song title in the keys of the dictionary and replaces the song_title with the canonical name from the dictionary. inplace modifies the original DataFrame directly. Using it spits out a “future warning” because there are planned changes for inplace.

# import the table and convert to a dictionary using to.dict()
song_title_dict = pd.read_csv('initial_tables/dict_song_titles.csv', header=0).set_index('alt_name')['canonical_name'].to_dict()

# replace the song_title with the canonical_name
spotify_cleaned['song_title'].replace(song_title_dict, inplace = True)
tiktok_cleaned['song_title'].replace(song_title_dict, inplace = True)
billboard_cleaned['song_title'].replace(song_title_dict, inplace = True)

Change Artist Names

The we did the exact same thing but for the artist names. The lookup table in this instance had two columns: canonical_name and alt_name. The process was essentially the same as changing the song titles.

dict_artist <- read_csv("initial_tables/dict_artist.csv", show_col_types = FALSE)
dict_artist
spotify_cleaned <- spotify_cleaned |>
  left_join(dict_artist, by = join_by(artist == alt_name)) |>
  mutate(song_title = coalesce(canonical_name, song_title)) |>
  select(song_title, artist, rank_s, tempo)
tiktok_cleaned <- tiktok_cleaned |>
  left_join(dict_artist, by = join_by(artist == alt_name)) |>
  mutate(song_title = coalesce(canonical_name, song_title)) |>
  select(song_title, artist, rank_t, tempo)
billboard_cleaned <- billboard_cleaned |>
  left_join(dict_artist, by = join_by(artist == alt_name)) |>
  mutate(song_title = coalesce(canonical_name, song_title)) |>
  select(song_title, artist, rank_b)
CREATE TABLE dict_artist AS
  SELECT * FROM 'initial_tables/dict_artist.csv';
UPDATE spotify_cleaned
  SET artist = dict_artist.canonical_name
  FROM dict_artist
  WHERE spotify_cleaned.artist = dict_artist.alt_name
UPDATE tiktok_cleaned
  SET artist = dict_artist.canonical_name
  FROM dict_artist
  WHERE tiktok_cleaned.artist = dict_artist.alt_name
UPDATE billboard_cleaned 
  SET artist = dict_artist.canonical_name
  FROM dict_artist
  WHERE billboard_cleaned.artist = dict_artist.alt_name
dict_artist = pd.read_csv('initial_tables/dict_artist.csv')

# Spotify
spotify_cleaned = spotify_cleaned.merge(dict_artist, left_on = 'artist', right_on = 'alt_name', how = 'left')
spotify_cleaned['artist'] = np.where( #if the alt_name column has a value then, replace artist with canonical_name
   spotify_cleaned['alt_name'].notna(), 
   spotify_cleaned['canonical_name'], 
   spotify_cleaned['artist'])
spotify_cleaned = spotify_cleaned.drop(columns = ['canonical_name', 'alt_name'])

# TikTok
tiktok_cleaned = tiktok_cleaned.merge(dict_artist, left_on = 'artist', right_on = 'alt_name', how = 'left')
tiktok_cleaned['artist'] = np.where(
   tiktok_cleaned['alt_name'].notna(), 
   tiktok_cleaned['canonical_name'], 
   tiktok_cleaned['artist'])
tiktok_cleaned = tiktok_cleaned.drop(columns = ['canonical_name', 'alt_name'])

# Billboard
billboard_cleaned = billboard_cleaned.merge(dict_artist, left_on = 'artist', right_on = 'alt_name', how = 'left')
billboard_cleaned['artist'] = np.where(
   billboard_cleaned['alt_name'].notna(), 
   billboard_cleaned['canonical_name'], 
   billboard_cleaned['artist'])
billboard_cleaned = billboard_cleaned.drop(columns = ['canonical_name', 'alt_name'])
dict_artist = pd.read_csv('initial_tables/dict_artist.csv', header=0).set_index('alt_name')['canonical_name'].to_dict()

spotify_cleaned['artist'].replace(dict_artist, inplace = True)
tiktok_cleaned['artist'].replace(dict_artist, inplace = True)
billboard_cleaned['artist'].replace(dict_artist, inplace = True)

Examples of the Changes

We did not print the tables above as they would have looked nearly identical to the original tables. Only a few songs out of about 900 songs were changed, so the changes would be difficult to see if the entire table was printed. Here is the same example from above using “Enemy.”

Show the code
spotify_cleaned |>
  filter(str_detect(song_title, 'Enemy')) #filter for song titles containing "Enemy"
Show the code
tiktok_cleaned |>
  filter(str_detect(song_title, 'Enemy'))
Show the code
billboard_cleaned |>
  filter(str_detect(song_title, 'Enemy'))
Spotify
SELECT *
FROM spotify_cleaned
WHERE song_title LIKE '%Enemy%'
2 records
song_title artist rank_s tempo
Enemy (with JID) - from the series Arcane League of Legends Imagine Dragons, JID, Arcane, League of Legends 3 77.011
Enemy - From the series Arcane League of Legends Imagine Dragons, Arcane, League of Legends 171 77.029
TikTok
SELECT *
FROM tiktok_cleaned
WHERE song_title LIKE '%Enemy%'
1 records
song_title artist rank_t tempo
Enemy (with JID) - from the series Arcane League of Legends Imagine Dragons 87 77.011
Billboard
SELECT *
FROM billboard_cleaned
WHERE song_title LIKE '%Enemy%'
1 records
song_title artist rank_b
Enemy (with JID) - from the series Arcane League of Legends Imagine Dragons and JID 15
Show the code
GT(spotify_cleaned.query('song_title.str.contains("Enemy")', engine='python'))
song_title artist rank_s tempo
Enemy (with JID) - from the series Arcane League of Legends Imagine Dragons, JID, Arcane, League of Legends 3 77.011
Enemy - From the series Arcane League of Legends Imagine Dragons, Arcane, League of Legends 171 77.029
Show the code
GT(tiktok_cleaned.query('song_title.str.contains("Enemy")', engine='python'))
song_title artist rank_t tempo
Enemy (with JID) - from the series Arcane League of Legends Imagine Dragons 87 77.011
Show the code
GT(billboard_cleaned.query('song_title.str.contains("Enemy")', engine='python'))
rank_b song_title artist
15 Enemy (with JID) - from the series Arcane League of Legends Imagine Dragons and JID